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Abstract —Conventional single image based localization meth¬ 
ods usually fail to localize a querying image when there exist large 
variations between the querying image and the pre-built scene. To 
address this, we propose an image-set querying based localization 
approach. When the localization by a single image fails to work, 
the system will ask the user to capture more auxiliary images. 
First, a local 3D model is established for the querying image set. 
Then, the pose of the querying image set is estimated by solving 
a nonlinear optimization problem, which aims to match the local 
3D model against the pre-built scene. Experiments have shown 
the effectiveness and feasibility of the proposed approach. 

Index Terms —Image set localization, structure-from-motion, 
camera set pose estimation 

1. Introduction 

Image-based localization has been widely used in many 
vision applications such as auto navigation m, augmented 
reality O, and photo collection visualization 13. The aim of 
image-based localization is to estimate the camera’s pose (ori¬ 
entation and position) in an interested area from a single 
querying image. Generally, there are three key steps in a 
single image-based localization system |i4l, Q: 1) 2D local 
features (e.g., SIFT El) are extracted from the querying image, 
2) matching between local features from the querying image 
and the 3D point cloud of the scene which also contains cor¬ 
responding feature descriptors, and 3) camera pose estimation 
by solving a perspective-n-point (PNP) problem |!3, ||8l, O. 

It is challenging to directly match a querying image to 
the 3D point cloud of the scene, especially when there exist 
large variations between them. The reason is that the model 
of the scene is usually built up under a fixed environment 
which is different to that of the querying image. For instance, 
a scene is reconstructed by high quality street views, and 
the surveillance cameras to be localized are working under 
different illumination conditions and are distant from that of 
the street view, as shown in Fig. Therefore, conventional 
single image localization methods |I4 |, O fail due to they rely 
heavily on 2D-3D matches between features, which are hardly 
to be available in these challenging scenarios. 

To address this, in this paper, we present an image-set 
querying based localization approach. The framework is shown 
in Fig. When the pose estimation by the conventional single 
image localization method is unsuccessful, the user are asked 
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Fig. 1: Illustration of the challenge of the feature matching. The 
scene 3D point cloud (yellow dots on the main building) is recon¬ 
structed by street view panoramas (bottom). Surveillance cameras 
from West Hallway (red), West Building (blue) and East Hallway 
(green) are required to localize. There are only a few feature matches 
between the West Hallway image and the scene. No feature matches 
can be found in the West Building image and the East Hallway image 
due to the large pose variations from the scene. 

to capture more auxiliary images to assist the localization task. 
Together with the querying image, these bridging images are 
aggregated to form a local 3D model. Then a 3D-to-3D feature 
matching scheme is taken to obtain reliable matches between 
the querying image set and the scene 3D point cloud. The 
pose of the image set is estimated by solving a nonlinear 
optimization problem. Besides using the reconstructed camera 
poses in the local camera set coordinate system, local 3D point 
information is explored in the nonlinear optimization stage for 
a further re-projection error minimization. Since the image set 
not only contains more information for localization, but also 
has stronger inherent geometry constraints, better localization 
performance can be obtained. 

This paper is organized as follows. Section |n| introduces a 
new camera set pose solver. Section details the proposed 
image set querying based localization approach. Section [Iv| 
presents the experimental results, and Section |V| concludes 
this paper. 

II. Camera set pose estimation 

A set of pinhole cameras could be considered as a generic 
camera which is represented by a bag of rays. Fig. [^illustrates 
the basic idea of using a set of pinhole cameras to form a 
generic camera, where these rays may not come from the same 
single optical center. 
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Fig. 2: The framework of the proposed image set querying based 
localization. 



Fig. 3: A pinhole camera set forms a generic camera and an observed 
ray ri of the generic camera comes from center Ci directing at the 
global fixed 3D point Xf. By applying a rigid transformation T, 
corresponding 3D point Xf in global coordinate system is mapped 
to local camera set coordinate system as TXf (top right). The rigid 
transformation T can be seen as the pose of the camera set and 
can be estimated using the proposed DLT solver. With a nonlinear 
optimization, the constraints introduced by the 3D point Xj in local 
camera set coordinate system are explored. Relative poses among 
these cameras Ci and the local 3D points Xj could be adjusted (red) 
for a further re-projection error minimization (bottom right). 


Similar to the single camera pose estimation, the camera 
set pose estimation problem can be defined as, given some 
rays (direction with its projection center Ci) and their 
corresponding global fixed 3D points Xf, find the camera set 
pose which is the rigid transformation (T = sR[I\ -C]) to map 
the matched 3D points from the global coordinate system to 
the camera set’s local coordinate system. 

According to the geometry constraints, we have 

r, X (TXf - C,) = 0, (1) 

where Ci is the projection center and is the ray’s direction, 
respectively. 

There are 12 unknown variables and 7 DOFs (6 for pose 
and 1 for scale) in the transformation T. Direct linear transfor¬ 
mation (DLT) method can be applied to solve the estimation 
problem. By rearranging Eqn. we have the following 
equation: 

( 2 ) 


By applying the Kronecker product property, Eqn. can be 
re-written as 

xf 0[riUvec{T) = [riUCi. (3) 

Since two independent constraints can be provided by one 
observation (r^, Ci^Xf), at least 6 points are required to solve 
the problem with DLT. Having obtained the transformation 
T, we need to project the 12-DOF space into a 7-DOF 
valid similarity transformation space. The K[R\t\ from camera 
matrix P could be decomposed nni by the transformation T 
as 

X, [R\t] = rq_decomposition{T)^ (4) 


Then the valid 7 DOF transform Tdlt is projected as 

TnLT = s[R\t] = *-^^^^^[R\t], (5) 

where the scale factor s is the average value of X’s diagonal 
elements. After that, the Levenberg-Marquardt algorithm is 
used to minimize the re-projection error, which is the golden 
standard in geometry estimation cni. Initializing T SiS T^lt, 
the optimization objective function is formulated as 


Tlm = argimny^ Wnj 


Pi(TXf) 




( 6 ) 


Previous solver and optimization considered the camera 
set as a whole rigid object, and the relative poses among 
pinhole cameras could not be changed. However, if the poses 
of these cameras are also reconstructed by a 3D reconstruction 
algorithm, relative poses among them may not be accurate. 
So a further optimization is needed to adjust the inner relative 
poses for better re-projection error minimization. 

Beside the corresponding global 3D points Xf, local 3D 
points Xl reconstructed in the camera set coordinate system 
from the image set are also involved in the optimization 
step. Then the nonlinear optimization becomes the following 
objective function. 


Tqpt 


arg mm 

T 


El 

El 




Pi{TX^ 

px „ 


+ 


( 7 ) 


\PiX\\\ 


where r\j is the observed ray from the j-th camera directing at 
locally reconstructed j-th 3D point Xj. After this optimization, 
more geometry constraints introduced by X^ can be explored 
and a better pose estimation can be achieved, as illustrated in 

Fig.m 


HI. Image set localization 

Compared to the ground area, surrounding buildings usually 
have rich information for localization. For example, if a person 
doesn’t known where he is, he will look around to get his 
position. To capture such context information of the scene 
for better localization, 360 panorama is a good choice due 


[nUTXf = [nUCi. 






















to its rich and compact information of surroundings. With 
the help of existing street view panoramas with high quality 
at regular distributed locations, high quality scene 3D model 
is guaranteed to be reconstructed and human efforts can be 
greatly alleviated. Furthermore, large scale 3D modeling is 
possible. 

Conventional structure-from-motion methods El, im are 
under the rectilinear camera model assumption by minimizing 
pixel re-projection errors. To get a unify representation of both 
panorama and rectilinear cameras, we use the pinhole camera 
model instead. The pinhole camera model considers each 2D 
pixel as a ray passing through a single projection center 
(optical center) which can be represented as a 3D coordinate 
x{u,v,w) lies on the unit spherical surface in the camera 
coordinate system. Calibration function x = k,{u,K),u = 
^(cc, K) defines the mapping between a ray x{u^ u, w) and 


its corresponding pixel u{u^v). Eqn. ^ are the calibration 
functions for the panoramic, fisheye and rectilinear cameras 
respectively. 


U — Uc V — Vc . . 

P = - J - ^ ,Uc = [Uc.Vc), 

(/, Uc)) = (cos(t) sin(p), sin(t), cos(t) cos(p)), 


( 8 ) 


,U — Uc. ,V — Vc. 

p = arctan(—j= arctan(— y —), 
(/, Uc)) = (cos(t) sin(p), sin(t), cos(t) cos(p)), 


( 9 ) 


where Uc is the principle point’s pixel coordinate (for 
panorama, any point can be principle point theoretically), / 
is the focal length, p is panning angle around y axis, and t 
is tilting angle around x axis. The geometry re-projection ray 
error becomes 

( 10 ) 

II 

where Xj = (X;l) is the homogeneous coordinate of j-th 
3D point. Pi = [R^t] is the i-th camera projection matrix, 
elements in Ki represent the intrinsic parameters of the i-th 
camera. 

According to the geometry properties of the pinhole camera 
model, conventional rectilinear camera model structure-from- 
motion building blocks such as two view geometry, triangu¬ 
lation, perspective-n-point and bundle adjustment, should be 
adjusted. By applying the pinhole model based structure-from- 
motion procedure, 3D scene point cloud and cameras can be 
reconstructed, as shown in Fig. Each 3D point corresponds 
to several 2D features (SIFT), and these 2D features are 
indexed by a kd-tree method for accelerating the online feature 
matching stage. 

When the querying image comes, we first apply the conven¬ 
tional single image based localization technique. If the pose 
estimation fails, it means that the querying image has large 
variations compared to the pre-built scene. Under this case, 
additional images are required to help matching between the 
querying image and the scene 3D point cloud. These images 
can be captured in the area from the target camera to the 


scene as a bridge. Together with the querying image, these 
bridging images are aggregated to form a local 3D model 
(3D point cloud and cameras) by the previous structure-from- 
motion procedure where the inherent geometry constraints 
are enhanced. Based on a 3D-to-3D feature matching stage, 
the image set 3D model is matched against to the scene 3D 
point cloud. Finally, the pose of querying image set model is 
estimated by using the camera set pose solver described in 
Section and the target camera’s pose can be extracted as 
shown in Fig. 

The 3D-to-3D matching stage works as follows: two nearest 
neighbors in the 3D point cloud of the scene for each local 3D 
point in the image set are first identified. Then, the ratio of the 
distance between the local 3D point and the nearest neighbor 
and the second nearest neighbor are tested. At last, the ratio 
test is employed reversely to filter out bad local 3D points to 
get enough reliable 3D-to-3D feature matches. 


IV. Experiments 

We build scene 3D point cloud for the Main Building 
in Tsinghua university with hundreds of meters size, which 
consists of 23 street view panoramas, 3067 3D points and 
14330 feature descriptors. The 3D-to-3D ratio test threshold 
is set to 0.6 and the scene 3D point cloud kd-tree is built by 
FLANN ifT^ with 95% accuracy. Three image sets are tested. 
West Hallway (14 images). West Building (15 images) and 
East Building (21 images). The querying image is distant to 
the scene. Experiments are conducted on methods including 
conventional single image based method (SI, the proposed 
camera set pose estimation with and without nonlinear opti¬ 
mization (camset, camseUopt). To evaluate the accuracy of the 
localization, as done in a whole 3D model reconstructed 
by using all images is taken as the ground truth and all the 
localization results are further checked manually. The minimal 
2D/3D matched inliers for conventional single image based 
localization is 12 (same as (H) below which pose estimation 
is regarded as failed. 

Conventional single image based localization method n 
cannot estimate the poses of the cameras in most cases due to 
the environment that they are taken in is different to that of the 
pre-built scene. The proposed methods can locate each image 
in the querying image set successfully, as shown in Tab. |l| 

The qualitative localization results are shown in Fig. and 
the quantitative results are listed in Tab. From which we 
can see that, 1) the querying image is successfully extracted by 
the proposed framework, and the orientation errors are usually 
very small (less than 4°), 2) only a few images can be localized 
by the conventional single image localization method, while 
all of them can be successfully localized by the proposed 
approach. 3) The location error of the proposed methods is 
smaller than the conventional single image based method, 
and the performance can be improved when the nonlinear 
optimization is further applied. 








TABLE I: Successful registration rate in the three querying image sets. 


method 

West Hallway 
#reg./#total #reg. rate 

East Hallway 
#reg./#total #reg. rate 

West Building 
#reg./#total #reg. rate 

single image based(4j 

4/14 

28.57% 

12/15 

80.00% 

1/21 

4.76% 

proposed 

14/14 

100.00% 

15/15 

100.00% 

21/21 

100.00% 


TABLE II: Evaluation of the location error on image set and target image. The statistical results for single querying image based method 
141 is from successful registered images only. 



image set 

target image 

dataset 

method 

#reg./#total 

min 

median 

max 

mean 

recon. err 


(m/deg) 

(m/deg) 

(m/deg) 

(m/deg) 

(m/deg) 


single image based|^ 

4/14 

2.081/1.001 

2.708/1.019 

5.029/1.092 

3.131/1.033 

5.028/1.092 

West Hallway 

camset 

14/14 

2.175/0.949 

3.031/0.999 

3.973/1.546 

3.030/1.042 

3.973/1.092 


camset+opt 

14/14 

1.770/0.915 

2.455/0.970 

3.248/1.546 

2 . 457 / 1.042 

3 . 248 / 1.095 


single image based(3 

12/15 

1.015/0.335 

3.672/1.278 

23.014/7.815 

8.418/2.827 

- 

East Hallway 

camset 

15/15 

2.833/1.116 

3.297/1.129 

3.697/1.154 

3.293/1.1.131 

3.641/1.154 


camset+opt 

15/15 

2.489/0.980 

2.908/0.991 

3.273/1.104 

2 . 904 / 0.991 

3 . 273 / 1.001 


single image basedl4l 

1/21 

4.729/3.847 

4.729/3.847 

4.729/3.847 

4.729/3.847 

- 

West Building 

camset 

21/21 

2.167/2.402 

4.516/3.334 

4.889/3.658 

4.247/3.320 

4.664/3.414 


camset+opt 

21/21 

1.638/2.418 

4.463/2.925 

4.963/3.347 

4 . 197 / 2.960 

4 . 487 / 2.979 



Eig. 4: Illustration of localization results. Top left shows the 
reconstructed scene in main building dataset. Top middle is the 
reconstructed ground truth scene using all three image sets. Top 
right is the final localization result. Bottom two rows show details of 
the West Hallway, West Building and East Hallway image sets with 
different methods, single image based |41 (blue), camset (yellow) and 
camset+opt (red). Images in the bottom row are enlarged from the 
images in the top row. We link the ground truth with corresponding 
estimated camera center to visualize the displacement. 


V. Conclusion 

In this paper, we have proposed a new framework to 
solve the problem of image-based localization by querying a 
bridging image set rather than a single image. Compared with 
the single image, the image set not only contains more in¬ 
formation for localization with more feature matches, but also 
has stronger inherent geometry constraints enforced by a local 
reconstruction. Therefore, it can be employed for improved 
localization performance. Experimental results have shown the 
effectiveness and feasibility of the proposed approach. In the 
future, we will study the way to capture the bridging image 
set efficiently to further improve the efficiency of the proposed 
approach. 
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